AUDIT
Audit Framework · AI Agent Infrastructure
AI Agent
Infrastructure
Audit
A complete audit framework for AI agent systems — covering infrastructure, autonomous actions, audit trails, data governance, EU AI Act compliance, access controls, and incident response.
AUG
2026
High-Risk Deadline
D1
Infrastructure & Systems
D1
AI AGENT INFRASTRUCTURE & SYSTEMS AUDIT
Systems
What makes this different from a standard IT audit: You are not just auditing servers — you are auditing an autonomous software stack that makes decisions and invokes real-world actions. Every component in the chain is a potential attack or failure surface.
AI agent stack — what the auditor must map
[U]
User / Trigger
Human input or automated event
[O]
Orchestrator
LangChain / AutoGen / CrewAI
[L]
LLM API
GPT-4 / Claude / Gemini
[T]
Tool Layer
APIs, DBs, file system, web
[M]
Memory
Vector DB / context store
[R]
Response
Action or returned answer
Infrastructure audit — test procedure per component
Test 1
Orchestration layer review
Framework version pinned?Dependency SCA scan run?Config stored in version control?No hardcoded prompts in source?
Test 2
LLM API connection security
API keys stored in secrets manager?Model version pinned (no auto-upgrade)?Rate limits and spend caps enforced?DPA signed with LLM provider?
Test 3
Tool integration inventory
Full list of tools agent can invoke?Each tool documented with purpose?Unused tools disabled?Any tool with write/delete access flagged?
Test 4
Memory & vector DB controls
Access control on vector DB?PII not stored in context store?Retention policy on memory defined?Data poisoning prevention in place?
Test 5
Network segmentation
Agent process isolated in own network segment?Egress filtering on outbound tool calls?Agent cannot reach internal admin systems?
Test 6
Supply chain & dependencies
All packages locked in requirements file?SBOM (software bill of materials) exists?Known CVEs scanned and patched?No unpinned pip install in CI/CD?
Outputs →
Agent Infrastructure Map
Systems Inventory
Integration Risk Register
Supply Chain Assessment
D2
AI ACTIONS & TOOL USE AUDIT
AI Behaviour
The most novel audit domain. Traditional IT audits test whether humans followed procedures. Here you are testing whether an autonomous agent acted within permitted boundaries — without a human approving every step.
Tool permission matrix — what agents are allowed to invoke
Tool / Action
Read
Write
Delete
External Call
Financial
Database query
✓
HITL
✗
✗
✗
Send email / message
✓
HITL
✗
✓
✗
Payment / transaction
✗
✗
✗
✗
HITL
Key: ✓ Permitted autonomously | ✗ Blocked by policy | HITL = Human-in-the-loop approval required before execution
AI actions audit — test procedure
Test 1
Tool boundary enforcement
Are tool permissions enforced in code, not just policy?Can agent bypass permission by chaining tools?Test: attempt unauthorised action — is it blocked?
Test 2
Human-in-the-loop (HITL) checkpoints
All destructive actions require HITL?HITL cannot be bypassed by prompt instruction?Timeout on pending HITL — agent stops, not proceeds?
Test 3
Prompt injection resistance
Agent tested against indirect prompt injection?Web-retrieved content cannot override system prompt?Input sanitisation on all external content?
Test 4
Multi-agent trust chains
Sub-agents cannot exceed orchestrator permissions?Inter-agent communication authenticated?Rogue sub-agent cannot elevate privilege?
Test 5
Rate limiting & scope caps
Max tool calls per session defined?Max tokens / cost per run capped?Runaway loop detection in place?Spend cap enforced at API gateway level?
Test 6
Output sanitisation
Agent output filtered before downstream use?PII stripped from outputs?Code execution output sandboxed?
Outputs →
Tool Permission Matrix
Action Risk Register
HITL Gap Analysis
Injection Test Results
D3
AUDIT TRAIL & OBSERVABILITY
Traceability
The EU AI Act requires automatic logging of high-risk AI operations. Without a complete, tamper-evident audit trail, you cannot investigate incidents, prove compliance, or explain decisions to regulators.
Complete log chain — every hop must be captured
[1]
Session Start
User ID, timestamp, trigger source
[2]
Prompt Log
System prompt + user input hash
[3]
LLM Decision
Model, version, token count, response
[4]
Tool Invocation
Tool name, params, result, latency
[5]
HITL Event
Approver ID, decision, timestamp
[6]
Final Output
Action taken or response returned
Audit trail — test procedure
Every tool call has a matching log entry?No gaps between session start and end?Replay 5 sessions — do logs reconstruct fully?
Logs written to write-once / append-only store?Hash chain or WORM storage enforced?No admin can delete logs without dual approval?
Test 3
Retention compliance
Retention period defined per data type?High-risk AI logs retained minimum required period?Automated deletion policy enforced at expiry?
Test 4
Searchability & incident response
Logs indexed by session ID, user, tool, time?Can reconstruct full decision chain in <30 min?Tested during tabletop incident exercise?
Test 5
PII handling in logs
PII in prompts masked / pseudonymised?Log access restricted to authorised personnel?Raw prompt content not stored in plain text?
Outputs →
Observability Gap Report
Log Architecture Diagram
Retention Compliance Check
Incident Replay Test Result
D4
DATA GOVERNANCE & PRIVACY
GDPR · Data
AI agents process personal data differently from traditional systems. Context windows, RAG pipelines, and memory stores create new data flows that standard GDPR assessments miss entirely.
AI-specific data flows — each requires a lawful basis and DPA
Data entering the agent
- User input (may contain PII)
- RAG retrieval from knowledge base
- Tool responses (CRM, DB, email data)
- Memory recall from prior sessions
- System prompt (may embed user profile)
Data processed by LLM
- Full context window sent to external LLM
- Fine-tuning datasets (if applicable)
- Embedding generation for vector storage
- Intermediate reasoning chains
- Every token sent = a data transfer
Data exiting the agent
- Agent outputs stored in logs
- Actions written to downstream systems
- Memory persisted to vector DB
- Reports or emails sent externally
- Embeddings retained indefinitely?
Signed DPA with OpenAI / Anthropic / Google?Data processing terms reviewed by legal?Provider certified for EU data processing?
Test 2
PII detection before LLM
PII scanner on all inputs before API call?Names, emails, IDs masked or tokenised?Medical / financial data never sent to LLM?
Test 3
RAG data governance
RAG source documents classified by sensitivity?Access control on vector DB by user role?Deleted docs removed from embeddings too?
Test 4
Right to erasure (GDPR Art. 17)
Can a user's data be fully deleted from logs?Embeddings containing PII can be removed?Test: submit erasure request — verify end-to-end
Test 5
Training data provenance
Fine-tuning data has documented lawful basis?No customer data used in training without consent?Data lineage documented end-to-end?
Outputs →
AI Data Flow Map
GDPR Gap Report
Vendor DPA Register
Erasure Test Evidence
D5
EU AI ACT COMPLIANCE ASSESSMENT
Regulatory
First task: classify every AI agent by risk tier. The tier determines everything — obligations, deadlines, and whether third-party conformity assessment is required. Misclassification is itself a compliance failure.
Step 1 — classify each AI agent by risk tier
Banned
Prohibited Feb 2025
Biometric surveillance in publicSocial scoringSubliminal manipulationExploitation of vulnerable groups
Audit action
- Confirm not deployed
- Document classification decision
High Risk
Deadline: Aug 2026
CV screening agentsCredit risk agentsMedical decision agentsFraud detection (financial)Critical infrastructure ops
Full compliance suite
- All 8 obligations apply
- Conformity assessment required
- EU AI database registration
Limited
Deadline: Aug 2026
Customer service chatbotsAI writing assistantsContent generation agents
Transparency only
- Disclose AI interaction
- Label AI-generated content
Internal knowledge agentsCode assistants (internal)Summarisation tools
Voluntary best practice
- Document classification
- Apply voluntary code
Step 2 — test all 8 EU AI Act obligations for high-risk agents
1. Risk management system
Documented, iterative risk process covering full AI lifecycle
Risk register exists? Updated at each model change? Reviewed by accountable owner?
2. Data governance
Training data documented, bias-examined, quality-checked
Data lineage documented? Bias metrics computed and within thresholds? Signed off?
3. Technical documentation
Complete technical file covering capabilities, limitations, architecture
Technical file current? Version-controlled? Accessible to regulators within 72hr?
4. Automatic logging
AI system generates automatic logs of operations for traceability
Logs auto-generated? Capture all decisions? Retained for required period? Tamper-evident?
5. Transparency to users
Users informed they are interacting with an AI system
Disclosure present before first interaction? Clear and prominent? Not buried in T&Cs?
6. Human oversight
Human operators able to monitor, intervene, stop the system
Kill switch tested? Override mechanism documented? Oversight role assigned and trained?
7. Accuracy & robustness
Consistent performance, resilience to errors and adversarial inputs
Accuracy metrics documented? Adversarial testing done? Edge case behaviour defined?
8. Conformity assessment
Self-assessment or third-party audit + EU AI database registration
Assessment completed? Registered in EU AI database? CE marking applied if required?
Step 3 — establish provider vs deployer obligations split
PROVIDER — built the AI system
Obligations of the system builder
- Technical documentation and data governance
- Conformity assessment before market placement
- EU AI database registration
- CE marking on high-risk systems
- Post-market monitoring and incident reporting
DEPLOYER — uses the AI system in a specific context
Obligations of the user organisation
- Human oversight measures implemented
- Input data quality and relevance maintained
- Users informed of AI interaction
- Fundamental rights impact assessment (FRIA)
- Cannot deploy in ways that exceed intended purpose
Outputs →
Risk Tier Classification
EU AI Act Gap Analysis
Conformity File Readiness
Provider / Deployer Split
D6
ACCESS, IDENTITY & PRIVILEGE CONTROLS
Access
AI agents are identities. A service account that runs an AI agent must be treated with the same rigour as a privileged human user — potentially more, since it can act autonomously at machine speed.
Test 1
Agent service account review
Every agent has a dedicated service account?Least privilege enforced on each account?No shared credentials across multiple agents?No human account used as agent identity?
All API keys stored in vault (not env vars)?Key rotation enforced — max 90 day lifetime?No secrets in source code or config files?Secret scanning in CI/CD pipeline?
Test 3
Deployment pipeline access
Who can deploy or modify agent config?MFA enforced on deployment pipeline?Change approval required before prod deployment?
Test 4
Agent-to-agent authentication
Sub-agents must authenticate to orchestrator?No implicit trust between agent processes?Token-based auth with short expiry?
Agent service accounts reviewed quarterly?Unused agents decommissioned and deprovisioned?Access review evidence retained?
Outputs →
Agent Access Matrix
Secrets Management Review
Privilege Escalation Risk Report
D7
INCIDENT RESPONSE & FAILURE MODES
Risk
AI incidents are different from standard IT incidents. Hallucination, prompt injection, runaway loops, and cascading multi-agent failures require dedicated playbooks — and the EU AI Act requires serious incident reporting to regulators.
AI-specific failure modes the auditor must test
Hallucination
Confident incorrect output
- Detection: output validation layer?
- Containment: human review before high-stakes action?
- Is hallucination rate benchmarked and tracked?
Runaway Loop
Agent calls itself recursively
- Max iteration limit enforced in code?
- Token / cost cap as secondary kill?
- Loop detection alert fires within 60s?
Prompt Injection
Malicious instruction via input
- Indirect injection via retrieved content?
- System prompt cannot be overridden?
- Adversarial test suite run regularly?
Cascading Failure
Multi-agent chain collapse
- Failure in one agent isolated from others?
- Circuit breaker pattern implemented?
- Partial failure state handled gracefully?
Model Drift
LLM behaviour changes unexpectedly
- Model version pinned — no silent upgrades?
- Regression tests run on model update?
- Rollback procedure tested and documented?
Unauthorised Action
Agent acts outside permitted scope
- Kill switch tested and working?
- Alert on out-of-scope tool invocation?
- Rollback for side effects of actions?
Incident response — test procedure
Emergency stop tested in staging within last 90 days?Kill switch accessible to non-technical staff?Agent cannot restart itself after kill?
Test 2
AI incident classification
AI incidents have their own severity taxonomy?Hallucination = classified as incident?Runaway loop = P1 incident automatically?
Test 3
EU AI Act serious incident reporting
Serious incident definition per EU AI Act known?Regulator notification process documented?Can notify regulator within required timeframe?
Test 4
Rollback and remediation
Model rollback tested — time to previous version?Side effects of agent actions can be reversed?Post-incident review process documented?
Test 5
Anomaly detection & alerting
Alerting on unusual token consumption?Alert on unexpected tool call patterns?On-call rotation covers AI incidents 24/7?
Outputs →
Failure Mode Register
AI Incident Playbook
Kill Switch Test Evidence
Regulatory Notification Procedure
SC
COMPLIANCE READINESS SCORECARD
Executive Summary
Use this scorecard as your audit executive summary. One row per domain — each scored against five compliance dimensions. Present this to the board before the detailed findings report.
D1 — Infrastructure & Systems
PASS
GAP
GAP
PASS
GAP
D2 — AI Actions & Tool Use
GAP
FAIL
FAIL
GAP
FAIL
D3 — Audit Trail & Observability
PASS
PASS
GAP
PASS
GAP
D4 — Data Governance & Privacy
GAP
GAP
FAIL
GAP
FAIL
D5 — EU AI Act Assessment
GAP
FAIL
FAIL
FAIL
FAIL
D6 — Access & Identity
PASS
PASS
GAP
PASS
N/A
D7 — Incidents & Failures
GAP
GAP
FAIL
GAP
FAIL
Scorecard key:
PASS Control exists and operating effectively |
GAP Partial — improvement required |
FAIL Control absent or not operating — immediate action required
Note: The scorecard above shows a representative example of findings for a typical early-stage AI deployment. Replace each cell with actual test results from your fieldwork. D5 and D4 most commonly produce FAIL ratings in first-time AI Act audits — organisations underestimate how new the GDPR and EU AI Act obligations are for AI-specific data flows.